added line cache #678

nitinNT · 2022-10-30T08:17:43Z

No description provided.

5nord · 2022-10-30T09:45:32Z

ttcn3/v2/syntax/syntax.go

+	if t.cachedLineBegin <= pos && pos < t.cachedLineEnd {
+		return Position{
+			Line:   t.cacheLine + 1,
+			Column: pos - t.lines[t.cacheLine] + 1,
+		}
+	}


Cache-handling is distributed over two methods (tree.position and tree.searchLines). I suggest you put all cache-related code into only one function (either in tree.position or in tree.searchLines, but not both). This increases cohesion and make the code easier to reason about.

5nord · 2022-10-30T09:54:34Z

ttcn3/v2/syntax/syntax.go

+	t.cachedLineBegin = i
+	t.cachedLineEnd = j


I think those two lines are the reason why the benchmarks are not as good as expected and why the tests fail:
i and j represent lines, while cachedLineBegin and cachedLineEnd are supposed to represent (file) offsets. In line 288 you're essentially comparing lines with offsets:

if t.cachedLineBegin <= pos && pos < t.cachedLineEnd

I think those two variables should be set after you found the proper line:

Suggested change

t.cachedLineBegin = i

t.cachedLineEnd = j

i have now put the comparison steps in searchLines function as below so searchLines function will return cacheLine

if t.cachedLineBegin <= i && j < t.cachedLineEnd { return t.cacheLine }

is this correct ?

Almost. Instead of i and j you should compare with pos:

if t.cachedLineBegin <= pos && pos < t.cachedLineEnd { return t.cacheLine }

I'd like to explain a bit how location handling works, this will clear things up:

Overview

One core functionality of this packages is to provide start and end position for syntax nodes. We call this a text span and it includes a file name, line numbers, column numbers and the file offsets. We could store this information inside the node object. For example:

type Node struct { // ... Span struct { Filename string Begin, End struct { Line, Column, Offset int } } }

This approach is strait-forward and easy, but also quite wasteful. It would add 64 bytes of repetitive data to every node and sometimes we need this detailed information only for a few nodes.

So, how can we improve this situation?

Instead of storing the whole span, we store the begin and end offsets (8 bytes instead of 64) and provide an algorithm to compute the detailed information only when needed. For example the node would look like this:

type Node struct { // ... Begin, End int32 }

How do we get from offset to line numbers?

When creating the syntax tree, we keep track of newline characters (\n) and store the start-offset for every line in a slice.

Example: Lines slice

The input Foo\nBar Baz\nFoobar\n contains three lines, the first line starts is at offset 0, the second line starts at offset 4, the last line starts at offset 12. The resulting lines slice would like this:

lines := []int{0, 4, 12}

Example: Lookup

In which line is the word Baz? We know from the syntax nodes, that the word begins at offset 8.
Now we search in the lines slice to find out to which line the offset 8 belongs to:

for i := range lines { if 8 < lines[i] { return i-1 } } return len(lines)-1

We need to subtract 1 from the found index, because this loop exits at one line after the one we were actually searching for; in this case it breaks at the third line (i==2).

Binary Search

Our files have 300-1500 lines on average. As a consequence the described lookup becomes very inefficient, because it's just a linear search. We can do much better if we use a binary search algorithm instead.
Here are some articles I found explaining binary search:

Everything you need to know

Geeks For Geeks

The tree.searchLines implements such a binary search. The variables i and j are the lower- and the higher limits required for the binary search. That's why i starts at 0 and walks towards the end of the lines slice, while j starts at the end (len(t.lines)) and walks towards the beginning of the lines slice.

Caching

The caching is independent of those binary search variables. For caching we need to check if the pos offset is on the same line as the last time searchLines was called. We could do it this way:

if t.cachedLineBegin <= pos && pos < t.cachedLineEnd { return t.cacheLine }

I just realized you don't need store the offsets in cachedLineBegin and cachedLineEnd, but you can do the check directly in the lines-slice:

if len(t.lines) > 1 && t.lines[t.cachedLine] <= pos && pos < t.lines[t.cachedLine+1] { return t.cachedLine }

thanks for detailed explanation.

5nord · 2022-10-30T09:56:36Z

ttcn3/v2/syntax/syntax.go

@@ -306,6 +314,7 @@ func (t *tree) searchLines(pos int) int {
 			j = h
 		}
 	}
+	t.cacheLine = int(i) - 1


Suggested change

t.cacheLine = int(i) - 1

line := int(i) - 1

t.cachedLineBegin = t.lines[line]

t.cachedLineEnd = t.lines[line+1]

t.cacheLine = line

i have tried this but for line 319 its giving me runtime error: index out of range [1] with length 1 as line array contains only one element.

You are right. t.lines[lint+1] does not work for the last line.

Maybe we could fix it with a condition? No very elegant, I know:

if line < len(t.lines) { t.cachedLineEnd = t.lines[line+1] } else { t.cachedLineEnd = len(t.lines) }

added line cache

6149ba4

5nord marked this pull request as ready for review October 30, 2022 09:28

5nord marked this pull request as draft October 30, 2022 09:28

5nord reviewed Oct 30, 2022

View reviewed changes

modified line cache

f747af8

5nord closed this Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added line cache #678

added line cache #678

nitinNT commented Oct 30, 2022

5nord Oct 30, 2022 •

edited

Loading

5nord Oct 30, 2022

nitinNT Oct 30, 2022

5nord Oct 31, 2022 •

edited

Loading

nitinNT Nov 5, 2022

5nord Oct 30, 2022 •

edited

Loading

nitinNT Oct 30, 2022

5nord Oct 31, 2022

-	t.cacheLine = int(i) - 1
+        line := int(i) - 1
+        t.cachedLineBegin = t.lines[line]
+        t.cachedLineEnd = t.lines[line+1]
+	t.cacheLine = line

added line cache #678

added line cache #678

Conversation

nitinNT commented Oct 30, 2022

5nord Oct 30, 2022 • edited Loading

Choose a reason for hiding this comment

5nord Oct 30, 2022

Choose a reason for hiding this comment

nitinNT Oct 30, 2022

Choose a reason for hiding this comment

5nord Oct 31, 2022 • edited Loading

Choose a reason for hiding this comment

Overview

So, how can we improve this situation?

How do we get from offset to line numbers?

Binary Search

Caching

nitinNT Nov 5, 2022

Choose a reason for hiding this comment

5nord Oct 30, 2022 • edited Loading

Choose a reason for hiding this comment

nitinNT Oct 30, 2022

Choose a reason for hiding this comment

5nord Oct 31, 2022

Choose a reason for hiding this comment

5nord Oct 30, 2022 •

edited

Loading

5nord Oct 31, 2022 •

edited

Loading

5nord Oct 30, 2022 •

edited

Loading